data_id x y
1 1 10 8.04
2 1 8 6.95
12 2 10 9.14
13 2 8 8.14
24 3 8 6.77
25 3 13 12.74
36 4 8 7.71
37 4 8 8.84
The Data Science Workflow II – Visualize
Charlotte Fresenius Privatuniversität
April 4, 2025
It is a collection of four data sets, each with 11 observations for variables x and y. Here is the first 2 rows of each:
data_id x y
1 1 10 8.04
2 1 8 6.95
12 2 10 9.14
13 2 8 8.14
24 3 8 6.77
25 3 13 12.74
36 4 8 7.71
37 4 8 8.84
All these four data sets have: same mean and SD in x and y, and same correlation between them:
data_id mean_x sd_x mean_y sd_y cor
1 1 9 3.317 7.501 2.032 0.816
2 2 9 3.317 7.501 2.032 0.816
3 3 9 3.317 7.500 2.030 0.816
4 4 9 3.317 7.501 2.031 0.817
So, seems like these data sets should look pretty similar, right?
“The simple graph has brought more information to the data analyst’s mind than any other device.”
John Tukey (Statistician, 1915 - 2000)
ggplot2.ggplot2 stands for the grammar of graphics, which is a coherent system for describing and building graphs developed by statistician Leland Wilkinson in his aptly named 2005 book.The grammar is a set of rules for producing graphics from data, taking pieces of data and mapping them to geometric objects (like points and lines) that have aesthetic attributes (like position, colour and size), together with further rules for transforming the data if needed, adjusting scales, adapting coordinate system and themes.
Note: a grammar limits the structure of what you can say, but it does not automatically make what you say meaningful, i.e. your code just being “grammatically” correct does not make the resulting plot sensible.
Let’s see an example of how this works with ggplot2 in practice.
ggplot is the palmerpenguins data set that can be obtained from CRAN in a package that bears exactly that name.Make sure the required packages palmerpenguins and ggplot2 are installed. Then we can load them and start by inspecting the data set:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
Artwork by @allison_horst
As zoologists, we might be interested in the following questions about the different types of Palmer penguins:
Let’s answer all of these questions using a single visualization.
Ultimately, we want to create the following plot:
Let’s recreate this plot step-by-step.
With ggplot2, we begin a plot with the function ggplot. Its first argument is the dataset to use in the graph.
For now, we have not told ggplot how to visualize the data, so we only have an empty canvas. Now, we will “paint” onto this canvas in layers.
The second argument of ggplot is called mapping. It defines how variables in our data set are mapped to visual properties (aesthetics) of our plot. This argument is always defined in the aes function, and the x and y arguments of aes specify which variables to map to the x- and y-axes. We want flipper length on the x and body mass on the y axis.
Now, our empty canvas has more structure: x- and y-axis have a range, ticks and labels. But the penguins themselves are not yet on the plot.
This is because we have not yet articulated, in our code, how to represent the observations from our data frame on our plot. To do so, we need a geom, a geometrical object used for data representation. In the example, we want our data represented by points, so we use geom_point:
Now, we start to have an actual scatter plot. Note that R warns us about missing data points. We will suppress this warning in the plots to come.
In ggplot2, geometric objects are made available with functions that start with geom_. People often describe plots by the type of geom that the plot uses, for example:
geom_point) as we just saw.geom_bar)geom_line)geom_boxplot)We layer a geom onto a ggplot literally in an additive way, i.e. by combining the two with a + sign. We will learn how to deal with many different geoms in this way.
Does the relationship between flipper length and body mass differ by species? To answer this, let’s represent species with different coloured points. To achieve this, we modify the aesthetic to additionally map species to colour:
In mapping a categorical variable to the colour aesthetic, ggplot2 automatically assigns a unique colour to each factor level (i.e. each species), a process known as scaling. ggplot2 will also add a legend that explains which values correspond to which levels.
Now, let’s add one more layer, a smooth curve displaying the relationship between body mass and flipper length. Since this is a new geometric object representing our data, we will add a new geom, namely geom_smooth based on a linear model (lm):
We have successfully added lines, but this plot does not look like our ultimate goal. Instead of having only one line for the entire data set, we have separate lines for each of the three penguin species. What went wrong?
When aesthetics are defined in ggplot(), at the top level, they’re passed down to each of the subsequent geom layers of the plot. However, each geom function can also take a mapping argument, which allows for aesthetics at the local level. Here, we want point colours, but not lines separated by species, so we should specify color = species locally for geom_point only:
We are getting close to our ultimate goal… Some minor details are still missing though.
Due to differences in colour perception (e.g. colour blindness), it is generally not a good idea to represent information using only colours on a plot. Therefore, in addition to colour, we can also map species to the shape aesthetic:
Note that the legend is automatically updated to reflect the different shapes of the points as well.
The “clean-up” work involves setting appropriate labels for title, subtitle, axes, and legend using the labs function and using the colourblind safe colour palette scale_color_colorblind from the ggthemes package (an extension of ggplot2):
library(ggthemes)
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(mapping = aes(color = species,
shape = species)) +
geom_smooth(method = "lm") +
labs(x = "Flipper length (mm)",
y = "Body mass (g)",
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
color = "Species",
shape = "Species") +
scale_color_colorblind()Now, we are done, we have built the plot of our ultimate goal step-by-step from the ground up. The example highlights the core principles of how ggplot2 allows us to build nice-looking informative graphs with relatively little coding effort.
ggplot2ggplot2Conceptually, the steps we went through are the same for every plot created with ggplot2. Based on a tidy data set, we always proceed as follows:
ggplot about our data set.These steps are always the same. We will now only learn in greater detail about how to tell ggplot what to do by going through several examples of different data sets, aesthetic mappings, geoms and more. As with anything, the key to improvement is LOTS of trial and error…
Let’s say we want to visualize the distribution of body mass (in grams) across all penguins in the sample using a histogram. For this, we map the variable body_mass_g to x and use geom_histogram:
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. ggplot2 by default creates 30 such bins. We can control this by setting bins ourselves:
We do not like the colour of this histogram. Previously, we set the color aesthetic, but here we do not want to map colour to a variable. Instead we want to set it to a specific colour. This we do in the geom, not in the aes:
Hm, this is not what we wanted… ggplot only coloured the border line of the bars. Turns out that if we want to fill the bars in a specific colour, we need to use the fill argument of the geom_histogram function:
An alternative visualization for distributions of numerical variables is a density plot, which is a smoothed-out version of a histogram. To create one, we simply switch out our geom to geom_density:
That’s a bit boring, let’s change the colour again (both of the line itself using color and the area below the curve using fill):
A simple bar chart visualizes the (absolute or relative) frequencies of the levels of a categorical variable. While we could use ggplot directly on the data, it is generally preferable to compute frequencies explicitly ourselves, e.g. for species:
species_freq <- as.data.frame(table(penguins$species))
names(species_freq)[1] <- "species"
# As we have already computed frequencies, we have to tell the bar geom
# that it should use the values as they are, i.e. use the "identity":
ggplot(species_freq, aes(x = species, y = Freq)) +
geom_bar(stat = "identity", fill = "deepskyblue4")If the variable has a nominal (and not ordinal) scale, then it is usually preferable to order the bars based on their frequency. For this, we have to reorder the factor levels for the species:
Often, we want the bars to be horizontal rather than vertical. To achieve this, we simply flip the axes using coord_flip. Additionally, we add nice axis labels:
To visualize the relationship between a categorical and a numerical variable, we can create parallel boxplots. For example, for the distribution of body weight by species, we could try:
An alternative to boxplots are so-called violin plots. In their simplest form, these are rotated density plots of the numerical variable in each category mirrored to create a symmetric, “violin-like” object:
Since density plots can also mask certain aspects of the distribution of the data, it is often advisable to additionally plot the actual data points on top of the violin plots. We can do this by layering a point geom on top:
If we have a lot of points close together, then they overlap and it becomes impossible to tell their distribution. To avoid this, we can slightly perturb the points in the x direction. This is achieved by switching to geom_jitter.
Another alternative to displaying the distribution of body weights in the three penguin species is to combine their densities into one plot in the usual way, i.e. with the numerical variable on the x axis.
Now, the densities are overlapping. To avoid this, we could of course not map the fill aesthetic. A nice alternative is to set the alpha aesthetic, which controls transparency. 0 is max transparency, 1 is max opaqueness (default).
We can use grouped or stacked bar plots to visualize the relationship between two categorical variables. For example, we might want to see how often the different species occur on the different islands.
We can create variations of this plot by changing the position argument of geom_bar. The default is stack. Most useful for comparing distributions are relative frequency plots achieved with position = "fill":
In the introductory example, we have already seen scatter plots and smooth curves as ways to illustrate the relationship between two numerical variables. Scatter plots are an absolute staple in a data scientist’s toolbox:
The introductory example displayed three variables: flipper_length_mm (x), body_mass_g (y) and species (color and shape). In general, we have the following options to introduce additional variables into a scatter plot:
color (categorical or numerical variables)shape (categorical)size (categorical or numerical variables)alpha (categorical or numerical variables)In the default setting, the points are quite small, which makes it hard to distinguish the shapes. To mitigate this, we additionally set the size aesthetic to something larger, namely 3.
Let’s say we wanted a scatter plot of penguin bill length vs. bill depth for different species. Additionally, we want information on body mass included as well. A logical choice to represent the latter would be size:
If we now really wanted to go mad, we could add the island back in via the shape aesthetic and additionally include the flipper length via the alpha aesthetic, so that longer flippers are represented by more opaque points:
To facet a plot by a single variable, we use facet_wrap. It takes a so-called formula as its first argument, which we create with ~ followed by the name of a categorical variable.
To facet a plot by two variables, we use facet_grid. We specify the variable to be faceted in the rows before the ~ and the variable to be faceted in the columns after it. For example, we could additionally facet sex in the rows:
For some data sets, we can also display an additional variable in a scatter plot by creating informative labels for the points.
Our penguin data set is not a good example to illustrate this, so we consider the elections_historic data set from the socviz package instead:
library(socviz)
elections_historic <- as.data.frame(elections_historic)
head(elections_historic[, c("year", "winner", "win_party", "ec_pct",
"popular_pct", "winner_label")]) year winner win_party ec_pct popular_pct winner_label
1 1824 John Quincy Adams D.-R. 0.3218 0.3092 Adams 1824
2 1828 Andrew Jackson Dem. 0.6820 0.5593 Jackson 1828
3 1832 Andrew Jackson Dem. 0.7657 0.5474 Jackson 1832
4 1836 Martin Van Buren Dem. 0.5782 0.5079 Buren 1836
5 1840 William Henry Harrison Whig 0.7959 0.5287 Harrison 1840
6 1844 James Polk Dem. 0.6182 0.4954 Polk 1844
This data set contains information on US presidential elections from 1824 to 2016. We will have a look at the winner’s share of the electoral college vote and the popular vote.
In a scatter plot, we want to have the popular vote share on the x axis and the electoral college vote share on the y axis. If we only map these two aesthetics, the plot looks quite uninformative…
We do not know who any of these points represent. This is a perfect use case for labeling the points. For this, we can use the label aesthetic together with the text geom (there is also a label geom, which would work as well):
That is much better already. However, now, there is a lot of overlap. To remedy this, we need an additional library called ggrepel, which provides us with a text_repel geom to avoid overlapping labels:
ggplot uses by default.x and / or y axis.ggplot2, virtually every aspect of a plot is customizable. We will only have a look at how to customize some of the most important ones.scale_ functionsguides function andtheme function.scale_ function.guides function.theme function.Let’s come back to an earlier example from the penguin data set:
This plot has four aesthetic mappings:
flipper_length_mm is mapped to x.body_mass_g is mapped to y.species is mapped to color.island is mapped to shape.And each of these mappings has a scale:
flipper_length_mm is a continuous variable, so the x scale is continuous.body_mass_g is a continuous variable, so the y scale is continuous.species is an unordered categorical variable, so the color scale is discrete.island is an unordered categorical variable, so the shape scale is discrete.Scales for these mappings may have labels, axis tick marks at particular positions, or specific colours or shapes. If we want to adjust them, we use one of the scale_ functions.
These functions have the following general structure:
So, for example:
scale_x_continuous controls x scales for continuous variablesscale_y_continuous controls y scales for continuous variablesscale_color_discrete controls color scales for discrete variablesscale_shape_discrete controls shape scales for discrete variablesLet’s see this in action. Say we wanted to override the default for tick marks by having a mark every 5 mm on the x axis and every 500 g on the y axis. We can use the breaks argument to control the position of the tick marks:
Now, let’s say we also wanted to change the colours and shapes used to represent species and island, respectively. For manually adapting them, we use scale_color_manual and scale_shape_manual:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g,
color = species, shape = island)) +
geom_point(size = 3) +
scale_x_continuous(breaks = seq(170, 230, by = 5)) +
scale_y_continuous(breaks = seq(3000, 6000, by = 500)) +
scale_color_manual(values = c("red", "blue", "green")) +
scale_shape_manual(values = c(17, 18, 8))The following shapes are most commonly used in R:
You can reference them by using the numbers indicated and passing them to scale_shape_manual in the order of the factor levels.
In the example before, we had Biscoe = 17, Dream = 18 and Torgersen = 8.
color or fill as an aesthetic, ggplot2 uses default colour or fill scales that do a fine enough job for exploratory data analysis, but typically have to be tweaked for publication-ready plots.scale_color_manual and the names of three colours (red, blue and green).ggplot2.ggplot2 are the colorbrewer palettes. For discrete variables, they come in three different types:
We can access all of these palettes in our plots by referencing them using the names given in the functions scale_color_brewer and scale_fill_brewer, respectively.
In our example, we have the species variable mapped to colour, which is an unordered categorical variable. A suitable palette to represent it might therefore be Dark2, for example.
To do this, we would run the following code (result on the next slide):
With these colorbrewer palettes, we can represent data from discrete variables, like categorical or count data.
However, what if we want to represent a continuous variable using colour, such as penguin bill length?
All sequential and diverging colorbrewer palettes can also be used for continuous scales. Depending on the aesthetic, we only have to pass them into the functions scale_color_distiller or scale_fill_distiller.
Penguin bill length does not have a neutral midpoint, so we use a sequential palette like YlOrRd to represent it (result on the next slide):
The options for colour choice are virtually endless. The goal here was merely to introduce colour scales and how to change them.
If you find yourself unhappy with the colour palettes provided by colorbrewer, have a look at the Palette Finder in the R Graph Gallery:
Now let’s discuss guides. For this, consider again our violin plot from earlier:
This plot contains redundant information: the species can be inferred from the axis labels, we do not need the additional legend for species.
To “switch off” the guide for a particular aesthetic, we can simply call the guides function with the name of the aesthetic as a named argument set to "none":
guides function can do much more than simply to “switch off” legends for individual scales.ggplot2, this bridge is best crossed, when you first come to it…Finally, a very important domain of plot customization is opened up by the theme function. It allows us to customize all aspects that are not directly related to the data being displayed.
A very common basic use of the theme function is to place the legend at a different location in the plot and to change font sizes of different labels (result on the next slide):
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g,
color = species, shape = island)) +
geom_point(size = 3) +
labs(title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
caption = "Source: palmerpenguins package",
color = "Species",
shape = "Island") +
theme(legend.position = "bottom",
plot.title = element_text(size = rel(1.75)),
plot.subtitle = element_text(size = rel(1.5)),
plot.caption = element_text(size = rel(0.75)))ggplot2, it is also incredibly cumbersome and usually not necessary.ggplot2 are: bw, classic, dark, gray (default), light, linedraw, minimal, test and void.ggthemes package that has plenty more to offer.Now that we know how to create and customize our own plots, we might want to save them to include them in a presentation or send them to a colleague.
The easiest way to do this is to use the function ggsave. By default, it will save the most recently created plot into the file name you provide:
You can save the plot as PDF, JPG, PNG (or several other formats) by changing the file ending accordingly.
You can also change the dimensions and the resolution of the plot by changing arguments width, height and dpi, respectively.
As always, for more details, refer to the documentation via ?ggsave.
Data Science and Data Analytics – The Data Science Workflow II